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5 SYSTEM FOR CONVERTING DATA 

TO A MARKUP LANGUAGE 

RELATED APPLICATION 

This application claims the benefit of U.S. Provisional Application Serial 
10 Number 60/138,979 filed June 14, 1999 under 35 U.S.C 1 19(e), and which is 
incorporated herein by reference. 

TECHNICAL FIELD 

This invention concern methods of converting electronic documents from 
15 one format to another format, particularly methods of converting documents to a 
Standard Generalized Markup Language (SGML) or an Extensible Markup 
Language (XML). 

BACKGROUND OF THE INVENTION 

20 Some electronic documents include text and annotation elements which 

indicate the semantics, hierarchy, structure, or format of the documents. The 
annotation elements, known as markups, within a document generally conform 
to a markup language which defines a set of annotation elements. The markup 
language defines which elements in the are required elements, which elements 

25 are optional elements, and how annotation elements distinguish from 

neighboring text. Examples of markup languages include Standard Generalized 
Markup Language (SGML), Extensible Markup Language (XML), and hypertext 
markup language (HTML). 

Additionally, electronic documents with markups are also associated with 

30 document type definition (DTD). The DTD for a particular document defines 
the rules and format of the document in terms of a set of declarations for a 
markup language, such as SGML or XML. The DTD for the document is either 
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embedded in the document or resides in a separate document associated the 
document. The DTD is used in parsing the document, that is, breaking the 
document into smaller chunks of data for further processing. 

Conventionally, marking up documents according a markup language 
5 entails inputting the document into a specific custom-conversion program 
designed for marking up documents in the markup language. Examples of 
custom conversion programs are programs created using tools such as 
Omnimark™ and Balise™. Thus, for example, marking up a document in 
SGML requires use of an SGML conversion program and marking up a 

10 document in HTML requires use of an HTML conversion program. In other 
words, the conventional approach to marking up documents using DTD-specific 
conversion programs. 

This conventional approach suffers from at least five problems. First, 
because the converters are dependent on the structure of a single DTD, they 

15 cannot be used to markup documents according to other markup languages. 
Second, the converter cannot easily adapt to changes to its corresponding DTD, 
since the grammatical and semantic rules of the DTD are hard-coded into the 
converter, requiring the logic of the converter to be reprogrammed. Third, the 
hard-coded DTD semantics in the converter increases its size and complexity, 

20 and thus reduce its reliability. Fourth, the dependency of the converter on a 
specific DTD also reduces the reusability of its source code for other DTDs. 
And fifth, conventional converters follow an all-or-nothing approach to markup, 
which prevents them from outputting a document with marked and unmarked 
portions. This restriction reduces the flexibility and application of the translator. 

25 Accordingly, there is a need in the art for a better ways of marking up 

documents. 
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SUMMARY 

To address these and other problems, the inventor devised systems, 
methods, and software for handling documents with different document type 
definitions (DTDs.) One exemplary method receives a document and an 
5 associated DTD, generates a mapping file from the document and the DTD, with 
the mapping file having one node representing each possible mapping of an 
element of the DTD to a portion of the document. The exemplary method 
further generates one or more paths representing possible paths from one node in 
the mapping file to another node in the mapping file, scores each possible path, 
10 and selects one of the paths based on the scores. Finally, the selected path is 
converted into a language described by the DTD. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of an exemplary computer system 
15 incorporating the invention. 

FIG. 2 is a block diagram of an exemplary system for disambiguating 
data according to the invention. 

FIG. 3 is a flowchart of an exemplary method for disambiguating data 
according to the invention. 
20 FIG. 4 is a block diagram of an apparatus of an embodiment of the 

disambiguator of the present invention. 

FIG. 5 is a block diagram of an exemplary method for producing a 
SGML DTD parseable file of an embodiment of the present invention. 

FIG. 6 is a block diagram of an exemplary mapping file data structure 
25 incorporating the invention. 

FIG. 7 is a block diagram of an exemplary data structure for representing 
candidate paths of a segment of an element of ambiguated document in an 
implementation of the invention. 



3 
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FIG. 8 is a block diagram of an exemplary data structure for representing 
candidate paths of two contiguous segments of elements of ambiguated 
document in accord with the invention. 



5 DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The following detailed description, which references and incorporates the 
Figures, describes and illustrates one or more specific embodiments of the 
invention. These embodiments, offered not to limit but only to exemplify and 
teach the invention, are shown and described in sufficient detail to enable those 

10 skilled in the art to practice the invention. Thus, where appropriate to avoid 
obscuring the invention, the description may omit certain information known to 
those of skill in the art. 

The detailed description is divided into five sections. The first section 
describes an exemplary computer system that incorporates the invention. The 

15 second section provides a system level overview of the invention. The third 

section describes examples of methods for an embodiment of the invention. The 
fourth section describes a particular SGML based implementation of the 
invention. Finally, the fifth section summarizes some advantages or salient 
features of the exemplary embodiment. 
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Exemplary Computer System Incorporating Invention 
FIG. 1 is a block diagram of an exemplary computer system (or 
environment) 100 incorporating the invention. The description of FIG. 1 
provides an overview of a computer hardware and a suitable computing 
5 environment in conjunction with which some embodiments of the present 
invention can be implemented. Embodiments of the present invention are 
described in terms of a computer executing computer-executable instructions. 
However, some embodiments of the present invention can be implemented 
entirely in computer hardware in which the computer-executable instructions are 

10 implemented in read-only memory. One embodiment of the invention can also 
be implemented in client/server computing environments where remote devices 
that are linked through a communications network perform tasks. Program 
modules can be located in both local and remote memory storage devices in a 
distributed computing environment. 

15 Computer 110 includes a processor 1 1 8, commercially available from 

Intel, Motorola, Cyrix and others, software 120, and a system bus 126, that 
operatively couples various system components including the system memory to 
the processing unit 118. For example, some embodiments implement one or 
more portions of system 100 using one or more mainframe computers or servers, 

20 such as the Sun Ultra 4000 server. 

The processor 1 1 8 executes exemplary DTD-independent document- 
conversion software 120. Embodiments of the present invention are not limited 
to any type of computer 1 10. In varying embodiments, computer 1 10 comprises 
a PC-compatible computer, a MacOS-compatible computer or a UNIX- 

25 compatible computer. The construction and operation of such computers are 
well known within the art. 

Furthermore, computer 1 10 can be communicatively connected to the 
Internet 130 via a communication device 128. In one embodiment, a 
communication device 128 is a modem that responds to communication drivers 

30 to connect to the Internet via what is known in the art as a "dial-up connection." 
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In another embodiment, a communication device 128 is an Ethernet or similar 
hardware (network) card connected to a local-area network (LAN) that itself is 
connected to the Internet via what is known in the art as a "direct connection" 
(e.g., Tl line, etc.). 

5 Computer 110 can be operated using at least one operating environment 

to provide a graphical user interface including a user-controllable pointer. Such 
operating environments include operating systems such as versions of the 
Microsoft Windows and Apple MacOS operating systems well-known in the art. 
Embodiments of the present invention are not limited to any particular operating 

10 environment, however, and the construction and use of such operating 
environments are well known within the art. 

The computer 110 can operate in a networked environment using logical 
connections to one or more remote computers. These logical connections are 
achieved by a communication device coupled to, or a part of, the computer 110. 

15 The logical connections depicted in FIG. 1 include a local-area network (LAN) 
151 and a wide-area network (WAN) 152. 



Exemplary System Level Overview 
FIG. 2 shows an exemplary DTD-independent document-conversion 

20 system 200 incorporating the present invention. System 200 includes a mapper 
210 that receives a document 220 of ambiguated and/or ambiguous data. The 
mapper 210 creates a mapping file (not shown) from the document 220. The 
mapping file (not shown) is transmitted to the disambiguator 240. The 
disambiguator 240 receives the mapping file and the document type definition 

25 (DTD) 230. The disambiguator 240 converts the mapping file into an output file 
250 that complies with the DTD 230 and/or disambiguates the mapping file in 
reference to, or based on, the DTD. The disambiguator 240 eliminates the need 
for specific programmatic solutions. The disambiguator 240 satisfies the need 
for a DTD-specific or a DTD-dependent translator. 
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System 200 includes a configuration file 260 that is received by the 
disambiguator 240 which specifies predetermined settings and/or parameters 
describing how the disambiguation process of the disambiguator 240 operate. 
For example, one setting and/or parameter that specifies the markup syntax of 
5 the DTD 230 and the output file 250, such as Extensible Markup Language 
(XML) and/or Standard Generalized Markup Language (SGML). 

System 200 also includes an activity log 270 that receives from the 
disambiguator 240 information that describes the activity of the conversion 
process of the disambiguator 240, and records the information. The 
10 disambiguator 210 can be embodied as computer hardware circuitry or as a 
computer-readable program, or a combination of both. 

System 200 also includes one or more DTDs in addition to DTD 230, 
which enables selective conversion to one of a plurality of markup languages. 
DTD input is selected from one of the plurality of DTDs. The indicator of which 
15 DTD to select is in the document 220, the configuration file 260, or from a 
different source. 

Exemplary Methods of the Invention 
In the previous section, a system level overview of the operation of an 
20 embodiment of the invention was described. In this section, the particular 
methods performed by the server and the clients of such an embodiment are 
described by reference to a series of flowcharts. Describing the methods by 
reference to a flowchart enables one skilled in the art to develop such programs, 
firmware, or hardware, including such instructions to carry out the methods on 
25 suitable computerized clients or servers. In computerized clients, one or more 
processor of the clients execute the instructions from computer-readable media, 
and in computerized servers, one or more processors of the clients execute 
instructions from computer-readable media. 

FIG. 3 shows an exemplary method 300 for disambiguating data 
30 according to the invention. Method 300 is performed by a program executing on, 



WO 00/77609 PCT/US00/16482 



or performed by firmware or hardware that is a part of, a computer, such as 
computer 1 10 in FIG. 1 . Method 300 can be embodied on a computer-readable 
magnetic, electronic, or optical medium comprising computer-executable 
instructions. 

5 Ambiguous or ambiguated data has more than one grammatical or 

semantic interpretation. Ambiguated data is not parseable because the data does 
not subscribe to a particular set of grammatical or semantic rules that are used in 
interpreting the data during parsing. More specifically, ambiguous data is data 
in which it is not certain which path in the tree structure of the DTD to follow in 

10 parsing the document, thus the data is unparseable. 

Method 300 includes receiving a document and an associated document 
type definition (DTD). In varying embodiments, the document is received 
before, during or after the DTD is received. 

Method 300 includes applying markup rules in the DTD to the document 

15 305, wherein the markup rules are defined as any programs or pattern-matching 
processes that can locate the elements of the DTD. The rules must locate the 
elements in the input file without context of other elements. For example if a 
rule to locate all "<para>" elements in the input file, when referencing the DTD, 
the "<para>" element may exist within a "<section>" element and also within a 

20 "<chapter>" element. It is frequently necessary to locate the parent elements 
first, which may in turn have parent elements. 

Next, method 300 entails creating a mapping file 310. The mapping file 
contains all of the locations of the DTD elements in the input file as specified by 
the markup rules. The mapping file includes one or more nodes, each node 

25 representative of a possible mapping of an element of document type definition 
to a portion of the document. The mapping file is generated from the document 
and the document type definition 

Subsequently, method 300 includes generating one or more candidate 
paths from the mapping file. Each candidate path represents a possible path 

30 from one node in the mapping file to another node in the mapping file. 
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Subsequently, method 300 includes receiving the candidate paths 315 
that compose a segment in the mapping file. A segment is a one or more 
candidate paths starting with a common node and ending with a common node. 
Disambiguation begins in action 315. 

5 Next, method 300 applies a scoring methodology to the candidate paths 

320. A score for each of the one or more candidate paths is determined. In one 
example, determining one or more scores includes determining two or more 
scores for each one of the one or more candidate paths and defining the highest 
of the two or more scores as the determined scored for the one of the one or 

10 more candidate paths. DTDs are commonly structured as a hierarchical tree, the ~ 
tree representing the abstract syntax of tokens in the DTD. In another example, 
determining the score(s) of the candidates path(s) includes parsing the candidate 
paths against the DTD by traversing the tree structure of the DTD, comparing a 
candidate path to the DTD and determining if the candidate path is a valid path 

15 or not. 

Each node in the DTD tree structure represents a DTD element. The 
DTD tree structure is traversed by picking a unique path in the DTD tree 
structure starting at the root node and traversing from node to node to an end 
node in the tree structure. As the DTD tree structure is traversed, the DTD is 

20 mapped to a document encoded according to the DTD. At some particular 
points in the traversal, the path is ambiguous because there is no unequivocal 
indication from the encoded document of which one of a number of nodes in the 
DTD to traverse. Ambiguous situations are problematic because that is a lack of 
certainty as to how to interpret the elements in the encoded document in 

25 consideration of the DTD. 

The highest score represents the closest match to the current position in 
the DTD. 

Applying a scoring methodology to the candidate paths includes 
determining a score in reference to, or based on, compliance with the document 
30 type definition without inferring additional tags for each of the one or more 

9 
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candidate paths. More specifically, this action scores a path highly if the path is 
directly acceptable to the DTD without inferring any additional tags. In two 
examples, a high score is represented by a "TRUE" in a Boolean scale, or as a 
score of 100 on a scale of 0 to 100 in which 100 is the highest score. 

5 Conversely, a path is scored low if the path is not directly acceptable to the DTD 
without inferring any additional tags. In two examples, a low score is 
represented by a "FALSE" in a Boolean scale, or as a score of 0 on a scale of 0 
to 100 in which 0 is the lowest score. 

Alternatively, a score for each of the candidates paths is determined by 

10 determining a score in reference to, or based on, compliance with the document 
type definition with inferring tags for each of the one or more candidate paths. 
More specifically, this action scores a path highly, such as a score of 99 on a 
scale of 0 to 100 in which 100 is the highest score, if the path is acceptable to the 
DTD with tag inference in reference to, or based on, the rules of the markup 

15 language of the DTD, such as SGML. Conversely, a path is scored low, such as a 
score of 0 on a scale of 0 to 100 in which 0 is the lowest score, if the path is not 
acceptable to the DTD with tag inference in reference to, or based on, the rules 
of the markup language of the DTD, such as SGML. 

Alternatively, a score for each of the candidates paths is determined by 

20 determining a score in reference to, or based on, a recursive examination of each 
path for a predetermined extent from each node in the tree structure of the 
mapping file for each of the one or more candidate paths. More specifically, 
acceptable paths are constructed by "looking ahead" to the other nodes in the 
document to determine if a particular path will lead to an acceptable path. The 

25 look-ahead process is controlled by a parameter stored in configuration file that 
controls how far ahead an attempt to make an invalid path valid is made. This 
alternative scoring process is a conventional "game tree" action in which a set of 
N possibilities or the next progression down the tree structure path, each 
progression is attempted. The scoring process is repeated to determine of the 

30 path leads to a successful outcome. 
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Thereafter, the candidate with the highest score is selected or retrieved 
325. Selecting one of the candidate paths based on, or from, the one or more 
scores generated in action 320. 

A determination as to whether the scores of all the candidate paths 
5 indicate no match 330. If the determination is negative, then the method 

continues at action 360. Otherwise, the determination is affirmative and method 
300 continues with determining whether or not the disambiguation that began at 
action 315 must be reset 335. When the determination as to whether or not the 
disambiguation must be reset, disambiguation is reset 340 in which all data 
10 structures associated with disambiguation are reset, and the method continues 
with receiving a list of candidate paths 315. When the determination as to 
whether or not the disambiguation must be reset is not affirmative, an attempt to 
resynchronized 345 in which a determination as to whether any of the candidates 
paths in the segment are valid, in which the last open start-tag is closed, a list of 
valid start and end tags that can exist in the DTD are received, and a 
determination as to whether or not any successful matches to at least one of the 
candidates paths exist. 

After attempting to resynchronize 345, a determination of the success of 
the ^synchronization is performed 350. When the determination indicates that 
the ^synchronization is not successful, the method continues in skipping current 
candidate paths 355 and the method continues with receiving a list of candidate 
paths 315, such as a next list of candidate paths. 

When the determination 330 that the scores of all the candidate paths 
indicate a match 330, a determination of whether or not there are multiple paths 
that have equal scores to each other 360 is performed. When the determination 
360 indicates that multiple paths have equal scores, then applying tie-breaking to 
determine the best path 365 is performed. 

When the determination 350 indicates that the ^synchronization is 
successful, or the determination 360 that there are not multiple paths having 
equal scores, or the determination 360 that multiple paths have equal scores and 

11 
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applying score tie-breaking 365 is performed, thereafter, the candidate path with 
the highest score is converted to SGML and/or XML (SGML/XML) and 
transmitted to an output file 370. The selected candidate path is converted into a 
language described by the DTD 230 , which could be SGML or XML depending 
5 upon whether the DTD 230 specified SGML or XML. More specifically, an 
element is generated that complies with the markup language described by the 
DTD. 

Subsequently, a determination of whether more candidates exist in the 
mapping file 375. If the determination that more candidate paths exist in the 

10 mapping file, then the method continues with receiving a list of candidate paths 
315, such as a next list of candidate paths. 

Where the mapping file includes more than one segments, actions 315 - 
370 will be repeated for each of the additional segments beyond the first 
segment, in which generating one or more candidate paths from the mapping file 

15 will use a singular segment from the mapping from which to generate candidate 
paths. A segment is a one or more candidate paths starting with a common solid 
node and ending with a common terminal node. 



SGML/XML Implementation 
20 In FIG. 4, a particular SGML and/or XML (SGML/XML) 

implementation of the invention is described in conjunction with the system 
overview in FIG. 2 and the method described in conjunction with FIG. 3 that is 
SGML/XML related. 

Embodiments of the invention are described as operating in a multi- 
25 processing, multi-threaded operating environment on a computer, such as 
computer 110 in FIG. 1. 

FIG. 4 is a block diagram of an apparatus 400 of an embodiment of the 
disambiguator 240 of FIG. 2 of the present invention. 

Apparatus 400 includes a permutater 420 of one or more candidate paths 
30 430 from the mapping file 410. The permutater 420 is operatively coupled to the 
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mapping file 410, with each of the one or more candidate paths 430 representing 
a possible path from one node in the mapping file 410 to another node in the 
mapping file 410. 

A scorer 440 receives the one or more candidate paths 430 from the 

5 permutater, yielding a corresponding number of one or more scores 450. 

Apparatus 400 also includes a selector 460 that receives the candidate 
paths 430, and the one the one or more scores 450, and selects candidate path 
having the highest score, yielding a selected candidate path. 

A converter 480 converts the selected candidate path 470 into an output 

10 file 490 encoded in the markup language described by the DTD 495, the markup 
language being either SGML or XML depending upon which markup language 
is described by the DTD 495. The converter 480 is operatively coupled to the 
selector 460 and receives the selected candidate path 470 and also receives the 
document type definition 495. 

15 FIG. 5 is a block diagram of a method 500 for producing a SGML DTD 

parseable file of an embodiment of the present invention. Method 500 includes 
receiving a non-SGML document 510. The method also includes receiving a 
SGML DTD associated with the document 520. In varying embodiments, 
receiving 5 1 0 can be performed before, during or after receiving 520. The DTD 

20 may also be embedded in the document. Thereafter, the method includes 

disambiguating the document in reference to, or based on, the SGML DTD 530. 
Disambiguating 530 yields disambiguated data. Subsequently, the method 
includes converting the disambiguated data into a file parseable in reference to, 
or based on, the SGML DTD 540. 

25 FIG. 6 is a block diagram of a mapping file data structure 600 on a 

computer-readable medium for representing a possible mapping of one or more 
elements of a document type definition to a portion of a document in an 
implementation of the invention. The mapping file data structure includes one 
or more segments, such as a first segment 610 and a second segment 620. A 

13 
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segment is a one or more candidate paths starting with a common solid node and 
ending with a common terminal node. 

Each of the segments includes a field storing data representing a solid 
node 630. A solid node 630 is a node encoded in SGML/XML markup language 
5 as the node originally existed in the document. The solid node 630 can represent 
either a SGML/XML start-tag, represent a SGML/XML end-tag, or represent the 
result of a markup rule locating a single element, such as a SGML/XML start-tag 
or a SGML/XML end-tag, at the position of the node in the document. The 
solid node 630 contains the SGML/XML element. One example of a 

10 SGML/XML element that a segment could comprise is chapter element. 

Furthermore, each of the segments includes a field storing data 
representing a quantum node, such as quantum nodes 640 of segment 1 610 and 
quantum node 645 of segment2 620. A quantum node is a node that represents 
multiple alternative tagging options for a single point in the document as created 

15 by the markup rules. Quantum nodes 640 and 645 represent two or more of the 
following: a SGML/XML start-tag, a SGML/XML end-tag, and a SGML/XML 
end-tag. A full-tag represents the start-tag followed by the content of the start- 
tag followed by the corresponding end-tag. Quantum node 640 contains one or 
more SGML/XML elements, such as a section element, a part element or a topic 

20 element, where the solid node 630 contains a chapter element. The quantum 
node 630 of segment 2 620 contain a SGML/XML element. 

Each of the segments also includes a field storing data representing a 
terminal node, such a terminal 650 of segmentl of 610 and terminal node 655 of 
segment2 620. A terminal node cannot contain further subnodes. The terminal 

25 node 650 contains text "Hello World" and the terminal node 655 contains text 
"some text." 

In addition, where the mapping file data structure 600 comprises two or 
more segments, such as segment 1 and segment 2, and where two contiguous 
segments, such as segment 1 and segment 2 are joined, in such a manner that the 
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field storing data representing a terminal node of the first segment is the field 
storing data representing a solid node of the second segment. 

FIG. 7 is a block diagram of a data structure 700 on a computer-readable 
medium for representing candidate paths of a segment of an element of 
5 ambiguated document in an implementation of the invention. 

Data structure 700 is a tree structure that represents the first segment 610 
in FIG. 6. Node 710 represents the solid node 630 in FIG. 6. Node 710 is linked 
to node 720, node 730, and node 740, which represent each of the elements 
"<B>" "<C>" and "<D>" in quantum node 640 of the first segment 610 in FIG. 
10 6, respectively. Node 750 represents the terminal node 650 of the first segment 
610 in FIG. 6. 

All permutations of the various paths that can be traversed in data 
structure 700 are shown in Table 1 : 

<AxB>Hello World 
15 <A><OHello World 

<AxD>Hello World 

Table 1 

Table 1 identifies the candidate paths that are scored and from which a 
candidate path is selected. Each candidate path represents a possible path from 
20 one node in the mapping file to another node in the mapping file data structure 
700. 

FIG. 8 is a block diagram of a data structure 800 on a computer-readable 
medium for representing candidate paths of two contiguous segments of 
elements of ambiguated document in an implementation of the invention. Data 

25 structure 800 is a tree structure that represents the contiguous first segment 610 
and second segment 620 in FIG. 6. Node 810 represents the solid node 630 in 
FIG. 6. Node 81 0 is linked to node 820, node 830, and node 840, which 
represent each of the elements "<B>" "<C>" and "<D>" in quantum node 640 of 
the first segment 610 in FIG. 6, respectively. Node 850 represents the terminal 

30 node 650 of the first segment 610 in FIG. 6. Node 850 is linked to node 860 and 

15 
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node 870, which represent each of the elements "<X>" and "<Y>" in quantum 
node 640 of the second segment 620 in FIG. 6, respectively. Node 860 and node 
870 are linked to node 880 which represents the terminal node 650 of the second 
segment 620 in FIG. 6. 
5 All permutations of the various paths that can be traversed in data 

structure 800 are shown in Table 2: 



<AxB>Hello World<X>some text 
<AxC>Hello World<X>some text 
<AxD>Hello World<X>some text 
<AxB>Hello World<Y>some text 
<AxOHello World<Y>some text 
<A><D>Hello World<Y>some text 



Table 2 



Table 2 identifies the candidate paths that are scored and from which a 
15 candidate path is selected. Each candidate path represents a possible path from 
one node in the mapping file to another node in the mapping file data structure 
800. 

Apparatus 200 and 400 can be embodied on a computer-readable 
magnetic, electronic, or optical medium comprising computer-executable 
20 instructions to perform method 300. Furthermore, data structures 600, 700 and 
800 can be embodied on a computer-readable magnetic, electronic, or optical 
medium. 

More specifically, in the computer-readable program embodiment, the 
programs can be structured in an object-orientation using an object-oriented 

25 language such as Java, Smalltalk or C++, and the programs can be structured in a 
procedural-orientation using a procedural language such as COBOL or C. The 
software components communicate in any of a number of means that are well- 
known to those skilled in the art, such as application program interfaces (A.P.I.) 
or interprocess communication techniques such as remote procedure call 

30 (R.P.C.), common object request broker architecture (CORBA), Component 
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Object Model (COM), Distributed Component Object Model (DCOM), 
Distributed System Object Model (DSOM) and Remote Method Invocation 
(RMI) . The components execute on as few as one computer as in computer 1 10 
in FIG. 1, or on at least as many computers as there are components. 
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Conclusion 

In furtherance of the art, the inventors have devised systems, methods, 
and software for generating data parseable against an arbitrary set of one or more 
document type definitions (DTDs). One exemplary method entails generating a 
5 list of a possible paths of an input element of data that is not encoded according 
to the DTD, determining the path that is the best fit with the DTD, and then 
generating the element in the syntax of the DTD. Determining the path that is 
the best fit entails parsing the path against the DTD. The best fit is expressed in 
a scoring scale, in which the best score indicates the best fit. Thereafter, the path 

10 with the best fit is translated in accordance to the DTD. 

Although specific embodiments have been illustrated and described 
herein, any arrangement which is calculated to achieve the same purpose may be 
substituted for the specific embodiments shown. This application is intended to 
cover any adaptations or variations of the present invention. One of ordinary 

15 skill in the art will appreciate that the invention can be implemented in an object- 
oriented design environment, a procedural design environment, or any other 
design environment that provides the required relationships. In particular, one of 
skill in the art will readily appreciate that the names of the methods and 
apparatus are not intended to limit embodiments of the invention. Furthermore, 

20 additional methods and apparatus can be added to the components, functions can 
be rearranged among the components, and new components to correspond to 
future enhancements and physical devices used in embodiments of the invention 
can be introduced without departing from the scope of embodiments of the 
invention. One of skill in the art will readily recognize that embodiments of the 

25 invention are applicable to future communication devices, different file systems, 
and new data types. 

The embodiments described above are intended only to illustrate and 
teach one or more ways of practicing or implementing the present invention, not 
to restrict its breadth or scope. The actual scope of the invention, which 
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embraces all ways of practicing or implementing the teachings of the invention, 
is defined only by the following claims and their equivalents. 
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CLAIMS 

1 . A method comprising: 

receiving a document and an associated document type definition; 
generating a mapping file from the document, comprising one or more 
5 nodes, each node representative of a possible mapping of an 

element of the document type definition to a portion of the 

document; 

generating one or more candidate paths from the mapping file, with each 
candidate path representing a possible path from one node in the 
10 mapping file to another node in the mapping file; 

determining a score for each of the one or more candidate paths; 
selecting one of the candidate paths based on the one or more scores; and 
converting the one of the candidate paths into a language described by 
the document type definition. 

15 

2. The method of claim 1 , wherein the document is received after the 
document type definition. 

3. The method of claim 1 , wherein the document type definition comprises 
20 a Standard Generalized Markup Language and/or an Extensible Markup 

Language document type definition. 

4. The method of claim 1 , wherein the determining a score for each one or 
more candidate paths comprises: 

25 determining a score based on compliance with the document type 

definition without inferring additional tags for each of the one or 
more candidate paths. 

5. The method of claim 1 , wherein the determining a score for each one or 
30 more candidate paths comprises: 

20 
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determining a score based on the document type definition with inferring 
tags for each of the one or more candidate paths. 

6. The method of claim 1 , wherein the determining a score for each one or 
5 more candidate paths comprises: 

determining a score based on a recursive examination of each path for a 
predetermined extent from each node in the tree structure of the 
mapping file for each of the one or more candidate paths. 

7. The method of claim 1, wherein determining one or more scores 
comprises determining two or more scores for each one of the one or more 
candidate paths and defining the highest of the two or more scores as the 
determined scored for the one of the one or more candidate paths. 

8. The method of claim 1 , wherein the document comprises a plurality of 
segments; and 

wherein the actions of generating one or more candidate, determining a 
score, selecting one of the candidate paths, and converting the one 
of the candidate paths are performed for each of the plurality of 
segments. 

9. A method comprising: 
receiving a document and an associated document type definition; 
generating a mapping file from the document and the document type 

definition, with the mapping file comprising one or more nodes, 
each node representative of a possible mapping of an element of 
the document type definition to a portion of the document; and 
disambiguating the mapping file based on the document type definition. 

10. The method of claim 9, wherein the disambiguating comprises: 

21 
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generating all of the permutations of the candidate paths from the 

mapping file, with each candidate path representing a possible 
path from one node in the mapping file to another node in the 
mapping file; 

5 determining a score for each one or more candidate paths; 

selecting one of the candidate paths based on the one or more scores; and 
converting the one of the candidate paths into a language described by 
the document type definition. 

10 11. A Standard Generalized Markup Language document-type-definition 
parseable file produced by a process comprising: 

receiving a non-Standard Generalized Markup Language document; 
receiving a Standard Generalized Markup Language document-type- 
definition associated with the document; 
15 disambiguating the document based on the Standard Generalized Markup 

Language document-type-definition, yielding disambiguated data; 
and 

converting the disambiguated data into a file parseable based on Standard 
Generalized Markup Language. 

20 

12. The Standard Generalized Markup Language document-type-defmition 
parseable file produced by the process of claim 1 1, wherein receiving a 
Standard Generalized Markup Language document-type-definition 
associated with the document comprises receiving a Standard 

25 Generalized Markup Language document-type-definition in the 

document. 

13. A computer-readable magnetic, electronic, or optical medium comprising 
computer-executable instructions for: 
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causing a computer to read a document and an associated document type 
definition; 

causing a computer to generate a mapping file based on the document 
and the document type definition, with the mapping file 
5 comprising one or more nodes, each node representative of a 

possible mapping of an element of the document type definition 
to a portion of the document; 
causing a computer to generate one or more candidate paths from the 
mapping file, with each candidate path representing a possible 
10 path from one node in the mapping file to another node in the 

mapping file; 

causing a computer to determine one or more scores from the one or 

more candidate paths; 
causing a computer to select one of the candidate paths based on the one 
1 5 or more scores; and 

causing a computer to convert the one of the candidate paths into a 
language described by the document type definition. 

14. A system comprising : 
20 means for receiving a document and an associated document type 

definition; 

means for generating a mapping file based on the document and the 

document type definition, with the mapping file comprising one 
or more nodes, each node representative of a possible mapping of 
25 an element of the document type definition to a portion of the 

document; 

means for determining one or more scores for one or more candidate 

paths, with each candidate path representing a possible path from 
one node in the mapping file to another node in the mapping file; 
30 and 
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means for selecting one of the candidate paths based on the one or more 

scores. 

15. The system of claim 13, wherein the means for receiving, the means for 
5 generating, the means for determining, and the means for selecting exist as 
respective software modules in a memory coupled to one or more computer 
processors or within various parts of a mainframe computer or within a SUN 
Ultra 4000 Server or witliin an IBM-compatible personal computer. 

10 16. A system for transacting in electronic commerce comprising: 
a processor; 

a storage device coupled to the processor; 
software means operative on the processor for disambiguating 
ambiguated data based on a document type definition. 

17. The system of claim 16, wherein the software means comprises: 
software means for generating all of the permutations of the candidate 

paths from a mapping file of the ambiguated data, with each 
candidate path representing a possible path from one node in the 
mapping file to another node in the mapping file. 

1 8. A computerized system comprising: 
a document of ambiguous data; 
a document type definition; 

a mapper of ambiguous data, operatively coupled to the document and 
operatively coupled to the document, yielding a mapping file 
from the document and the document type definition; 
a disambiguator operatively coupled to the mapping file and the 

document type definition, yielding an output file; and 
wherein the document type definition describes a markup syntax; and 
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wherein the output file complies with the syntax described by the 
document type definition. 

19. The computerized system of claim 1 8, the system comprising: 
5 a configuration file operatively coupled to the disambiguated which 

specifies predetermined settings and/or parameters of the 
disambiguator. 

The computerized system of claim 18, the system comprising: 
an activity log operatively coupled to the disambiguator that receives and 
record information that describes the activity of the conversion 
process of the disambiguator. 

The computerized system of claim 18, the disambiguator comprising: 
a permutater of one or more candidate paths from the mapping file, 

operatively coupled to the mapping file, with each candidate path 
representing a possible path from one node in the mapping file to 
another node in the mapping file; 
a scorer of the one or more candidate paths, operatively coupled to the 
permutater, yielding a corresponding number of one or more 
scores; 

a selector of one or more candidate paths, based on the one or more 
scores, operatively coupled to the scorer, yielding a selected 
candidate path; and 
a converter of the selected candidate path into a language described by 
the document type definition, operatively coupled to the selector 
of one or more candidate paths and operatively coupled to the 
document type definition. 

30 22. The computerized system of claim 2 1 , the disambiguator comprising: 

25 



20. 

10 



21. 

15 



20 



WO 00/77609 PCT/US00/1 6482 



a selector of segments, operatively coupled to the mapping file and 

operatively coupled to the permutator, that receives the mapping 
file and transmits a segment of the nodes of the mapping file to 
the permutator; 

5 a comparator operatively coupled to the mapping file and the permutator 

that determines the extent of remaining segments in the mapping 
file. 

23. The computerized system of claim 21 , the scorer comprising: 

10 a tie-breaker operatively coupled to the permutater that selects one of a plurality 
of candidate paths have equal scores. 

24. A computer-readable magnetic, electronic, or optical medium 
comprising: 

15 a converter; and 

a plurality of document type definitions associated with the converter. 

25. A computer-readable magnetic, electronic, or optical medium 
comprising: 

20 a document; 

a document type definition associated with the document; and 
a converter operably coupled to the document and the document type 
definition that converts the document into an output file that 
complies with the document type definition. 

25 

26. A data structure stored on a computer-readable medium for representing 
a possible mapping of an element of a document type definition to a portion of a 
document comprising: 

one or more segments. 

30 
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27. The data structure of claim 26, wherein each of the one or more segments 
comprises: 

a field storing data representing a solid node; 
a field storing data representing a quantum node; and 
5 a field storing data representing a terminal node. 

28. The data structure of claim 27, wherein the data structure comprises two 
or more segments, and wherein two contiguous segments further comprising, a 
first segment and a second segment, are joined whereby the field storing data 

10 representing a terminal node of the first segment further comprises the field 
storing data representing a solid node of the second segment. 

29. A computer data signal embodied in a carrier wave and representing a 
sequence of instructions which, when executed by a processor, cause the 

1 5 processor to perform: 

receiving a non-Standard Generalized Markup Language document; 
receiving a Standard Generalized Markup Language document-type- 
definition associated with the document; 
disambiguating the document based on the Standard Generalized Markup 
20 Language document-type-definition, yielding disambiguated data; 

and 

converting the disambiguated data into a file parseable based on Standard 
Generalized Markup Language. 

25 30. A computer data signal embodied in a digital data stream comprising data 
comprising: 

a representation of a solid node; 

a representation of a quantum node; and 

a representation of a terminal node; 
30 wherein the computer data signal is generated by a method comprising: 
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generating a mapping file from the document and the document type 

definition, with the mapping file comprising one or more nodes, 
each node representative of a possible mapping of an element of 
the document type definition to a portion of the document. 

31. A computer data signal embodied in a digital data stream comprising data 
comprising: 

a representation of a Standard Generalized Markup Language document- 
type-definition parseable file; 
wherein the computer data signal is generated by a method comprising: 
generating one or more candidate paths from a mapping file of an 

ambiguated document and a document type definition, with each 
candidate path representing a possible path from one node in the 
mapping file to another node in the mapping file; 
determining a score for each of the one or more candidate paths; 
selecting one of the candidate paths based on the one or more scores; and 
converting the one of the candidate paths into Standard Generalized 
Markup Language document-type-definition parseable file 
described by the document type definition. 

32. A method comprising: 
providing a disambiguator; 

disambiguating a first document based on a first DTD using the 
disambiguator; 

disambiguating a second document based on a second DTD using the 
diasambiguator, with the second DTD being different than the 
first DTD. 

33. A method of disambiguating electronic documents, comprising: 
providing a set of two or more DTDs; 
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receiving a first document for disambiguation; and 
selecting at least one of the set of DTDs; and 

disambiguating the first document based on the selected one of the set of 
DTDs. 

5 

34. A method of disambiguating electronic documents, comprising: 
receiving a first document for disambiguation, the document having first 

and second portions, each portion being ambiguated; 
disambiguating the first portion of the document and not the second 
10 portion of the document; and 

outputting a second document comprising the disambiguated first portion 

of the document and second portion of the first document. 

35. A method comprising selectively converting to one of a plurality of 
15 markup languages. 
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